Executive Summary

Swiftkey develop a word prediction application that is used while typing into a keyboards on a mobile keyboard. When the user types: “I went to the” : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

In this project, we use R to build a predictive model using text data provided by the Data Science Capstone course. The data consists of text from ‘Blogs’, ‘News’ and ‘Twitter’ totaling more than 4 million lines and ??? unique words.

TL;DR

In a nutshell, here’s a summary of the data analysis performed in this report.

  1. Load the raw data.
  2. Extract a 1% subsample of the data.
  3. Preprocess the data to remove stopword, convert to lowercase and other items.
  4. Generate 2-grams, 3-grams and 4-grams.
  5. Present a technique to use the Dirichlet-multinomial model as a language model.

Explore the Data

First, we sample 1% of the lines in the files in order to speed up the data exploration. The implementation is in sample_capstone_data in sample_data.R. We use tm R package to load each sample file for analysis.

source num_lines num_unique_words mean_word_freq median_word_freq
1 twitter 23602 8040 20 9
2 blogs 8993 12414 15 6
3 news 10103 12850 15 7

Text Cleaning

We perform the following text processing steps prior to parsing ngrams.

  • Remove all Punctuation
  • Remove all Numbers
  • Convert all words to Lowercase
  • Remove English Stopwords
  • Strip extra whitespace
  • Remove Profanity

Word Frequencies

For example, look at the word frequency distribution for the sample data

p <- all_docs_word_plot(sample_vector_corpus)
print(p)

NGram Frequencies

Let’s load all the data sources into 1 corpus.

docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)

Here are top bigrams.

ngram_2 <- get_docterm_matrix(docs, 2)
p2 <- generate_word_frequency_plot(ngram_2$wf, "Top Bigrams for Sampled Text")
print(p2)

Here are top tri-grams

ngram_3 <- get_docterm_matrix(docs, 3)
p3 <- generate_word_frequency_plot(ngram_3$wf, "Top Trigrams for Sampled Text")
print(p3)

Here are top 4-grams

ngram_4 <- get_docterm_matrix(docs, 4)
p4 <- generate_word_frequency_plot(ngram_4$wf, "Top 4-grams for Sampled Text")
print(p4)

Word Prediction using an NGrams

We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.

Word Prediction for: ‘data’

Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.

results <- perform_search(ngram_tree, c("data"))
print(results)
##                   12                   10                  
## recommended_words "entry"              "streams"           
## likelihood        "0.0606060606060606" "0.0505050505050505"
##                   8                    7                   
## recommended_words "recovery"           "dating"            
## likelihood        "0.0404040404040404" "0.0353535353535354"
##                   7                   
## recommended_words "personalize"       
## likelihood        "0.0353535353535354"

Word Prediction for: ‘data entry’

Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.

results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
##                   6      6        
## recommended_words "just" "respond"
## likelihood        "0.5"  "0.5"

Next Steps

  • Build a model using the more than a 1% sample.
  • Deploy the ngram tree to the server-side of an Shiny Application.